Exploratory Data Analysis (EDA) in Python¶
Introduction¶
Exploratory Data Analysis of the Pima Indians Diabetes dataset
The features present in the dataset are:
Pregnancies: Sį» lįŗ§n mang thai
Glucose (plasma glucose concentration): Nį»ng Äį» glucose trong huyįŗæt tʰʔng
Blood Pressure (diastolic): Huyết Ôp tâm trưƔng
Skin Thickness: Äį» dĆ y lį»p da
Insulin
BMI (Body Mass Index)
Diabetes Pedigree Function (a function that scores likelihood of diabetes based on family history): HĆ m sį» phįŗ£ hį» bį»nh dį»±a trĆŖn tiį»n sį» gia ÄƬnh
Age
The target "Outcome" has 2 classes: 0: Non-diabetes 1: diabetes
Initial EDA How: Using powerfull python moduls [pandas, matplotlib, and seaborn]
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
Load the dataset¶
import pandas as pd
# Danh sĆ”ch tĆŖn cį»t tʰʔng ứng vį»i thứ tį»± trong file
column_names = [
'Pregnancies', # Sį» lįŗ§n mang thai
'Glucose', # Nį»ng Äį» glucose sau 2 giį»
'BloodPressure', # Huyết Ôp tâm trưƔng (mm Hg)
'SkinThickness', # Äį» dĆ y lį»p da tam Äįŗ§u (mm)
'Insulin', # Nį»ng Äį» insulin sau 2 giį» (mu U/ml)
'BMI', # Chį» sį» khį»i cĘ” thį» (kg/m²)
'DiabetesPedigreeFunction', # HĆ m sį» phįŗ£ hį» bį»nh ÄĆ”i thĆ”o Äʰį»ng
'Age', # Tuį»i (nÄm)
'Outcome' # Biįŗæn phĆ¢n loįŗ”i: 0 (khĆ“ng mįŗÆc) hoįŗ·c 1 (mįŗÆc bį»nh)
]
# Äį»c file CSV khĆ“ng có tiĆŖu Äį»
df = pd.read_csv('/content/drive/MyDrive/Data_Analysis/pima_indians_diabetes/pima-indians-diabetes.data.csv', header=None, names=column_names)
# Kiį»m tra dữ liį»u
df.to_csv('/content/drive/MyDrive/Data_Analysis/pima_indians_diabetes/pima_diabetes_header.csv', index=False)
df = pd.read_csv('/content/drive/MyDrive/Data_Analysis/pima_indians_diabetes/pima-indians-diabetes.data.csv')
A first look and non-graphical EDA¶
df
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 1 |
| 1 | 1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 0 |
| 2 | 8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | 1 |
| 3 | 1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | 0 |
| 4 | 0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 763 | 10 | 101 | 76 | 48 | 180 | 32.9 | 0.171 | 63 | 0 |
| 764 | 2 | 122 | 70 | 27 | 0 | 36.8 | 0.340 | 27 | 0 |
| 765 | 5 | 121 | 72 | 23 | 112 | 26.2 | 0.245 | 30 | 0 |
| 766 | 1 | 126 | 60 | 0 | 0 | 30.1 | 0.349 | 47 | 1 |
| 767 | 1 | 93 | 70 | 31 | 0 | 30.4 | 0.315 | 23 | 0 |
768 rows Ć 9 columns
df.shape
(768, 9)
df.head()
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 1 |
| 1 | 1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 0 |
| 2 | 8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | 1 |
| 3 | 1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | 0 |
| 4 | 0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | 1 |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 768 entries, 0 to 767 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Pregnancies 768 non-null int64 1 Glucose 768 non-null int64 2 BloodPressure 768 non-null int64 3 SkinThickness 768 non-null int64 4 Insulin 768 non-null int64 5 BMI 768 non-null float64 6 DiabetesPedigreeFunction 768 non-null float64 7 Age 768 non-null int64 8 Outcome 768 non-null int64 dtypes: float64(2), int64(7) memory usage: 54.1 KB
df.isnull().sum()
| 0 | |
|---|---|
| Pregnancies | 0 |
| Glucose | 0 |
| BloodPressure | 0 |
| SkinThickness | 0 |
| Insulin | 0 |
| BMI | 0 |
| DiabetesPedigreeFunction | 0 |
| Age | 0 |
| Outcome | 0 |
df.describe()
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
|---|---|---|---|---|---|---|---|---|---|
| count | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 |
| mean | 3.845052 | 120.894531 | 69.105469 | 20.536458 | 79.799479 | 31.992578 | 0.471876 | 33.240885 | 0.348958 |
| std | 3.369578 | 31.972618 | 19.355807 | 15.952218 | 115.244002 | 7.884160 | 0.331329 | 11.760232 | 0.476951 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.078000 | 21.000000 | 0.000000 |
| 25% | 1.000000 | 99.000000 | 62.000000 | 0.000000 | 0.000000 | 27.300000 | 0.243750 | 24.000000 | 0.000000 |
| 50% | 3.000000 | 117.000000 | 72.000000 | 23.000000 | 30.500000 | 32.000000 | 0.372500 | 29.000000 | 0.000000 |
| 75% | 6.000000 | 140.250000 | 80.000000 | 32.000000 | 127.250000 | 36.600000 | 0.626250 | 41.000000 | 1.000000 |
| max | 17.000000 | 199.000000 | 122.000000 | 99.000000 | 846.000000 | 67.100000 | 2.420000 | 81.000000 | 1.000000 |
Are the classes (species) balanced?¶
df.columns
Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
dtype='object')
#kiį»m tra giĆ” trį» 0 khĆ“ng hợp lý vį» mįŗ·t sinh lý hį»c
cols_to_check = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']
(df[cols_to_check] == 0).sum()
| 0 | |
|---|---|
| Glucose | 5 |
| BloodPressure | 35 |
| SkinThickness | 227 |
| Insulin | 374 |
| BMI | 11 |
#Chuįŗ©n hóa giĆ” trį» bįŗ±ng median (tĆnh trung vį»)
cols_to_fix = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']
# Thay thįŗæ giĆ” trį» 0 bįŗ±ng median cį»§a từng cį»t (chį» tĆnh median từ cĆ”c giĆ” trį» khĆ”c 0)
for col in cols_to_fix:
median = df[col][df[col] != 0].median()
df[col] = df[col].replace(0, median)
# Kiį»m tra lįŗ”i sau khi xį» lý
print((df[cols_to_fix] == 0).sum())
Glucose 0 BloodPressure 0 SkinThickness 0 Insulin 0 BMI 0 dtype: int64
df['Outcome'].value_counts()
| count | |
|---|---|
| Outcome | |
| 0 | 500 |
| 1 | 268 |
Correlation Between Variables (sự tưƔng quan giữa cÔc biến)¶
import numpy as np
import pandas as pd
# Kiį»m tra phĆ¢n bį» nhĆ£n
labels, counts = np.unique(df["Outcome"], return_counts=True)
print(dict(zip(labels, counts)))
# TĆnh ma trįŗn tʰʔng quan
correlation_matrix = df.corr()
print(correlation_matrix)
{np.int64(0): np.int64(500), np.int64(1): np.int64(268)}
Pregnancies Glucose BloodPressure SkinThickness \
Pregnancies 1.000000 0.128213 0.208615 0.081770
Glucose 0.128213 1.000000 0.218937 0.192615
BloodPressure 0.208615 0.218937 1.000000 0.191892
SkinThickness 0.081770 0.192615 0.191892 1.000000
Insulin 0.025047 0.419451 0.045363 0.155610
BMI 0.021559 0.231049 0.281257 0.543205
DiabetesPedigreeFunction -0.033523 0.137327 -0.002378 0.102188
Age 0.544341 0.266909 0.324915 0.126107
Outcome 0.221898 0.492782 0.165723 0.214873
Insulin BMI DiabetesPedigreeFunction \
Pregnancies 0.025047 0.021559 -0.033523
Glucose 0.419451 0.231049 0.137327
BloodPressure 0.045363 0.281257 -0.002378
SkinThickness 0.155610 0.543205 0.102188
Insulin 1.000000 0.180241 0.126503
BMI 0.180241 1.000000 0.153438
DiabetesPedigreeFunction 0.126503 0.153438 1.000000
Age 0.097101 0.025597 0.033561
Outcome 0.203790 0.312038 0.173844
Age Outcome
Pregnancies 0.544341 0.221898
Glucose 0.266909 0.492782
BloodPressure 0.324915 0.165723
SkinThickness 0.126107 0.214873
Insulin 0.097101 0.203790
BMI 0.025597 0.312038
DiabetesPedigreeFunction 0.033561 0.173844
Age 1.000000 0.238356
Outcome 0.238356 1.000000
Graphical EDA¶
Let's start from where we reached for the non-graphical EDA
Are the classes (Outcome) balanced?¶
#Using pandas
df["Outcome"].value_counts().plot.bar()
plt.savefig("Outcome_hist.png")
#Using matplotlib
counts = df["Outcome"].value_counts()
plt.bar(counts.index, counts.values)
<BarContainer object of 2 artists>
plt.hist(df['Outcome'])
(array([500., 0., 0., 0., 0., 0., 0., 0., 0., 268.]), array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ]), <BarContainer object of 10 artists>)
#Using seaborn
sns.countplot(x = df['Outcome']);
What we can conclude:
Outcome distribution: The dataset consists of roughly 65% samples with Outcome = 0 (non-diabetic) and 35% with Outcome = 1 (diabetic). This distribution is not severely imbalanced, which means common evaluation metrics such as Accuracy, Precision, Recall, F1-score, and ROC AUC can be reliably applied without the need for data balancing techniques (e.g., oversampling or undersampling).
Data preprocessing: Unrealistic zero values in medical features (Glucose, BloodPressure, SkinThickness, Insulin, BMI) were replaced with the median of each corresponding feature. This approach eliminates invalid values while preserving the natural distribution of the data.
Significance: After replacement, the dataset is cleaner and more representative of real-world medical data, providing a more reliable basis for model training and improving the accuracy of diabetes prediction.
#dùng Accuracy dį»± ÄoĆ”n mįŗÆc bį»nh bao nhiĆŖu phįŗ§n trÄm
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import pandas as pd
# VĆ dỄ: Load bį» dữ liį»u Pima Diabetes (CSV)
df = pd.read_csv("/content/drive/MyDrive/Data_Analysis/pima_indians_diabetes/pima_diabetes_header.csv")
# TĆ”ch Äįŗ·c trʰng vĆ nhĆ£n
X = df.drop("Outcome", axis=1)
y = df["Outcome"]
# Chia train/test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Huįŗ„n luyį»n mĆ“ hƬnh Logistic Regression
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
# Dį»± ÄoĆ”n trĆŖn test set
y_pred = model.predict(X_test)
# TĆnh Accuracy
accuracy = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)
print(f"Accuracy: {accuracy:.4f}")
Confusion Matrix: [[78 21] [18 37]] Accuracy: 0.7468
Comparison graphs¶
Scatterplots
What?
- uses the Cartesian coordinate system (x-y axes) to display the values of two variables in a dataset.
Why ?
- Scatterplots help visualize the relationship between two variables ā for example: does BMI increase as Glucose increases?
Very useful for:
- Detecting trends
- Checking correlation
- Identifying outliers
#Scatterplot using pandas
df.plot.scatter('Glucose','BMI')
<Axes: xlabel='Glucose', ylabel='BMI'>
#Scatterplot using matplotlib
plt.scatter('Glucose', 'BMI', data=df)
<matplotlib.collections.PathCollection at 0x7814c79b21e0>
plt.scatter(df['Glucose'], df['BMI'])
<matplotlib.collections.PathCollection at 0x7814c54eb7a0>
plt.plot('Glucose', 'BMI', data=df, marker='.',linestyle='none')
[<matplotlib.lines.Line2D at 0x7814c539a540>]
#Scatterplot using seaborn
sns.scatterplot(x='Glucose', y='BMI', data=df)
<Axes: xlabel='Glucose', ylabel='BMI'>
sns.scatterplot(x = 'Glucose', y = 'BMI', hue=df['Outcome'], data=df)
<Axes: xlabel='Glucose', ylabel='BMI'>
What we can conclude:
Patients with diabetes (Outcome = 1) tend to have higher glucose levels, BMI, and age compared to non-diabetic patients.
Non-diabetic cases (Outcome = 0) usually show lower glucose and BMI values, and often fall in the younger age range.
Some features such as Insulin and SkinThickness show large variation but still indicate higher median values in diabetic cases.
As seen above, different visualization tools could be used; from now on, only seaborn plots are demonstrated, while others are left for you to explore.
Correlogram¶
sns.pairplot(df)
<seaborn.axisgrid.PairGrid at 0x7814c52dac30>
sns.pairplot(df,hue="Outcome")
<seaborn.axisgrid.PairGrid at 0x7814bca79a60>
What we can conclude:
Two clusters can be observed when comparing glucose and BMI: diabetic patients generally lie in the higher range, while non-diabetic patients cluster in the lower range.
Patients with diabetes (Outcome = 1) tend to have higher glucose levels, BMI, insulin, and age.
Non-diabetic patients (Outcome = 0) mostly show lower glucose and BMI values, with relatively younger age.
Diabetic cases exhibit clear differences in their characteristics compared to non-diabetic ones: they have higher glucose and BMI, while non-diabetic individuals show more normal values in these measures.
Some features like BloodPressure and SkinThickness lie in between, showing average variations across both groups.
Heatmap¶
fig = plt.figure(figsize = (15,9))
sns.heatmap(df.corr(), cmap='Blues', annot = True);
df.hist();
sns.boxplot(x='Outcome', y='Glucose', data=df)
<Axes: xlabel='Outcome', ylabel='Glucose'>
df.boxplot(by="Outcome", figsize=(12, 8));
sns.violinplot(x='Outcome', y='Glucose', data=df)
<Axes: xlabel='Outcome', ylabel='Glucose'>